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ABSTRACT 

An  important  first  step  in  developing  a  cross-lingual 
question  answering  system  is  to  understand  whether 
techniques  developed  with  English  text  will  also  work  with 
other  languages,  such  as  Chinese.  The  Marsha  Chinese 
question  answering  system  described  in  this  paper  uses 
techniques  similar  to  those  used  in  the  English  systems 
developed  for  TREC.  Marsha  consists  of  three  main 
components:  the  query  processing  module,  the  Hanquery 
search  engine,  and  the  answer  extraction  module.  It  also 
contains  some  specific  techniques  dealing  with  Chinese 
language  characteristics,  such  as  word  segmentation  and 
ordinals  processing.  Evaluation  of  the  system  is  done  using 
a  method  based  on  the  TREC  question-answering  track. 
The  results  of  the  evaluation  show  that  the  performance  of 
Marsha  is  comparable  to  some  English  question  answering 
systems  in  TREC  8  track.  An  English  language  version  of 
Marsha  further  indicates  that  the  heuristics  used  are 
applicable  to  the  English  question  answering  task. 
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1.  Introduction 

A  number  of  techniques  for  “question  answering”  have 
recently  been  evaluated  both  in  the  TREC  environment 
(Voorhees  and  Harman,  1999)  and  in  the  DARPA  TIDES 
program.  In  the  standard  approach  to  information  retrieval, 
relevant  text  documents  are  retrieved  in  response  to  a 
query.  The  parts  of  those  documents  that  may  contain  the 
most  useful  information  or  even  the  actual  answer  to  the 
query  are  typically  indicated  by  highlighting  occurrences  of 
query  words  in  the  text.  In  contrast,  the  task  of  a  question¬ 
answering  system  is  to  identify  text  passages  containing  the 
relevant  information  and,  if  possible,  extract  the  actual 
answer  to  the  query.  Question  answering  has  a  long  history 
in  natural  language  processing,  and  Salton’s  first  book 
(Salton,  1968)  contains  a  detailed  discussion  of  the 
relationship  between  information  retrieval  and  question¬ 
answering  systems.  The  focus  in  recent  research  has  been 
on  extracting  answers  from  very  large  text  databases  and 
many  of  the  techniques  use  search  technology  as  a  major 
component.  A  significant  number  of  the  queries  used  in 
information  retrieval  experiments  are  questions,  for 
example,  TREC  topic  338  “What  adverse  effects  have 
people  experienced  while  taking  aspirin  repeatedly?”  and 
topic  308  “What  are  the  advantages  and/or  disadvantages  of 
tooth  implants?”  In  question-answering  experiments,  the 
queries  tend  to  be  more  restricted  questions,  where  answers 
are  likely  to  be  found  in  a  single  text  passage,  for  example, 
TREC  question-answering  question  1 1  “Who  was  President 
Cleveland’s  wife?”  and  question  14  “What  country  is  the 
biggest  producer  of  Tungsten?” 

The  TREC  question-answering  experiments  have,  to  date, 
used  only  English  text.  As  the  first  step  towards  our  goal  of 
cross-lingual  question  answering,  we  investigated  whether 
the  general  approaches  to  question  answering  that  have 
been  used  in  English  will  also  be  effective  for  Chinese. 
Although  it  is  now  well  known  that  statistical  information 
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retrieval  techniques  are  effective  in  many  languages,  earlier 
research,  such  as  Fujii  and  Croft  (1993,  1999),  was  helpful 
in  pointing  out  which  techniques  were  particularly  useful 
for  languages  like  Japanese.  This  research  was  designed  to 
provide  similar  information  for  question  answering.  In  the 
next  section,  we  describe  the  components  of  the  Chinese 
question  answering  system  (Marsha)  and  the  algorithm  used 
to  determine  answers.  In  section  3,  we  describe  an 
evaluation  of  the  system  using  queries  obtained  from 
Chinese  students  and  the  TREC-9  Chinese  cross-lingual 
database  (164,779  documents  from  the  Peoples  Daily  and 
the  Xing-Hua  news  agencies  in  the  period  1991-1995). 

2.  Overview  of  the  Marsha  Question 
Answering  System 

The  Chinese  question-answering  system  consists  of  three 
main  components.  These  are  the  query  processing  module, 
the  Hanquery  search  engine,  and  the  answer  extraction 
module.  The  query  processing  module  recognizes  known 
question  types  and  formulates  queries  for  the  search  engine. 
The  search  engine  retrieves  candidate  texts  from  a  large 
database.  The  answer  extraction  module  identifies  text 
passages  that  are  likely  to  contain  answers  and  extracts 
answers,  if  possible,  from  these  passages.  This  system 
architecture  is  very  similar  to  other  question-answering 
systems  described  in  the  literature. 

More  specifically,  the  query  processing  module  carries  out 
the  following  steps: 

(1)  The  query  is  matched  with  templates  to  decide  the 
question  type  and  the  “question  words”  in  the  query.  We 
define  9  question  types.  Most  of  these  correspond  to  typical 
named  entity  classes  used  in  information  extraction  systems. 
For  each  question  type,  there  are  one  or  more  templates. 
Currently  there  are  170  templates.  If  more  than  one 
template  matches  the  question,  we  pick  the  longest  match. 
For  example,  a  question  may  include  (how  many 

dollars).  Then  both  (how  many  dollars)  and 

(how  many)  will  match  the  question.  In  this  case,  we  will 
pick  and  assign  “MONEY”  to  the  question  type. 

The  following  table  gives  examples  for  each  question  type: 

TEMPLATE  QUESTION  TRANSLATION 

TYPE 

If  PERSON  which  person 

If |S LOCATION  which  city 


ORGANIZATIO 

N 

what  organization 

DATE 

what  date 

ff-AHtl'hl 

TIME 

what  time 

MONEY 

how  many  dollars 

A 

PERCENTAGE 

what  is  the 

percentage 

NUMBER 

how  many 

OTHER 

what  is  the  meaning 
of 

(2)  Question  words  are  removed  from  the  query.  This  is  a 
form  of  “stop  word”  removal.  Words  like  “lf|5AA” 
(which  person)  are  removed  from  the  query  since  they  are 
unlikely  to  occur  in  relevant  text. 

(3)  Named  entities  in  the  query  are  marked  up  using  BBN’s 
IdentiFinder  system.  A  named  entity  is  kept  as  a  word  after 
segmentation. 

(5)  The  query  is  segmented  to  identify  Chinese  words. 

(6)  Stop  words  are  removed. 

(7)  The  query  is  formulated  for  the  Hanquery  search  engine. 
Hanquery  is  the  Chinese  version  of  Inquery  (Broglio, 
Callan  and  Croft,  1996)  and  uses  the  Inquery  query 
language  that  supports  the  specification  of  a  variety  of 
evidence  combination  methods.  To  support  question 
answering,  documents  containing  most  of  the  query  words 
were  strongly  preferred.  If  the  number  of  query  words  left 
after  the  previous  steps  is  greater  than  4,  then  the  operator 
#and  (a  probabilistic  AND)  is  used.  Otherwise,  the 
probabilistic  passage  operator  #UWn  (unordered  window) 
is  used.  The  parameter  n  is  set  to  twice  the  number  of  words 
in  the  query. 

Hanquery  is  used  to  retrieve  the  top  10  ranked  documents. 
The  answer  extraction  module  then  goes  through  the 
following  steps: 

(8)  IdentiFinder  is  used  to  mark  up  named  entities  in  the 
documents. 


(9)  Passages  are  constructed  from  document  sentences.  We 
used  passages  based  on  sentence  pairs,  with  a  1 -sentence 
overlap. 

(10)  Scores  are  calculated  for  each  passage.  The  score  is 
based  on  five  heuristics: 

First  Rule: 

Assign  0  to  a  passage  if  no  expected  name  entity  is  present. 
Second  Rule: 

Calculate  the  number  of  match  words  in  a  passage. 

Assign  0  to  the  passage  if  the  number  of  matching  words  is 
less  than  the  threshold.  Otherwise,  the  score  of  this  passage 
is  equal  to  the  number  of  matching  words  (count_m). 

The  threshold  is  defined  as  follows: 

threshold  =  count_q  if  count_q<4 

threshold  =  count_q/2.0+l.0  if  A<=count_q<=% 

threshold  =  count_q/3. 0+2.0  if  count _q>% 

count_q  is  the  number  of  words  in  the  query. 

Third  Rule: 

Add  0.5  to  score  if  all  matching  words  are  within  one 
sentence. 

Fourth  Rule: 

Add  0.5  to  score  if  all  matching  words  are  in  the  same  order 
as  they  are  in  the  original  question. 

Fifth  Rule: 

score  =  score  +  count_m/(size  of  matching  window) 

(11)  Pick  the  best  passage  for  each  document  and  rank  them. 

(12)  Extract  the  answer  from  the  top  passage: 

Find  all  candidates  according  to  the  question  type.  For 
example,  if  the  question  type  is  FOCATION,  then  each 
location  marked  by  IdentiFinder  is  an  answer  candidate.  An 
answer  candidate  is  removed  if  it  appears  in  the  original 
question.  If  no  candidate  answer  is  found,  no  answer  is 
returned. 

Calculate  the  average  distance  between  an  answer  candidate 
and  the  location  of  each  matching  word  in  the  passage. 


Pick  the  answer  candidate  that  has  the  smallest  average 
distance  as  the  final  answer. 


3.  Evaluating  the  System 

We  used  51  queries  to  do  the  initial  evaluation  of  the 
question-answering  system.  We  selected  26  queries  from 
240  questions  collected  from  Chinese  students  in  our 
department,  because  only  these  had  answers  in  the  test 
collection.  The  other  25  queries  were  constructed  by  either 
reformulating  a  question  or  asking  a  slightly  different 
question.  For  example,  given  the  question  “which  city  is  the 
biggest  city  in  China?”  we  also  generated  the  questions 
“where  is  the  biggest  city  in  China?”  and  “which  city  is  the 
biggest  city  in  the  world?”. 

The  results  for  these  queries  were  evaluated  in  a  similar,  but 
not  identical  way  to  the  TREC  question-answering  track. 
An  “answer”  in  this  system  corresponds  to  the  50  byte 
responses  in  TREC  and  passages  are  approximately 
equivalent  to  the  250  byte  TREC  responses. 

For  33  of  51  queries,  the  system  suggested  answers.  24  of 
the  33  were  correct.  For  these  24,  the  “reciprocal  rank”  is  1, 
since  only  the  top  ranked  passage  is  used  to  extract  answers. 
Restricting  the  answer  extraction  to  the  top  ranked  passage 
also  means  that  the  other  27  queries  have  reciprocal  rank 
values  of  0.  In  TREC,  the  reciprocal  ranks  are  calculated 
using  the  highest  rank  of  the  correct  answer  (up  to  5).  In  our 
case,  using  only  the  top  passage  means  that  the  mean 
reciprocal  rank  of  0.47  is  a  lower  bound  for  the  result  of  the 
50  byte  task. 

As  an  example,  the  question 

(Which  city  is  the  biggest  city  in  China?),  the  answer 
returned  is  _Ll?§  (Shanghai).  In  the  top  ranked  passage, 
“China”  and  “Shanghai”  are  the  two  answer  candidates  that 
have  the  smallest  distances.  “Shanghai”  is  chosen  as  the 
final  answer  since  “China”  appears  in  the  original  question. 

As  an  example  of  an  incorrect  response,  the  question 

(In  which  year  did  Jun  Xie  defeat  a  Russian  player  and 
win  the  world  chess  championship  for  the  first  time?) 
produced  an  answer  of  (today).  There  were  two 

candidate  answers  in  the  top  passage,  “October  18”  and 
“today”.  Both  were  marked  as  DATE  by  Identifinder,  but 
“today”  was  closer  to  the  matching  words.  This  indicates 
the  need  for  more  date  normalization  and  better  entity 
classification  in  the  system. 

For  44  queries,  the  correct  answer  was  found  in  the  top- 
ranked  passage.  Even  if  the  other  queries  are  given  a 


reciprocal  rank  of  0,  this  gives  a  mean  reciprocal  rank  of 
0.86  for  a  task  similar  to  the  250  byte  TREC  task.  In  fact, 
the  correct  answer  for  4  other  queries  was  found  in  the  top 
5  passages,  so  the  mean  reciprocal  rank  would  be  somewhat 
higher.  For  2  of  the  remaining  3  queries,  Hanquery  did  not 
retrieve  a  document  in  the  top  10  that  contained  an  answer, 
so  answer  extraction  could  not  work. 


4.  Further  Improvements 

These  results,  although  preliminary,  are  promising.  We 
have  made  a  number  of  improvements  in  the  new  version 
(v2)  of  the  system.  Some  of  these  are  described  in  this 
section. 

One  of  the  changes  is  designed  to  improve  the  system’s 
ability  to  extract  answers  for  the  questions  that  ask  for  a 
number.  A  number  recognizer  was  developed  to  recognize 
numbers  in  Chinese  documents.  The  numbers  here  are 
numbers  other  than  DATE,  MONEY  and  PERCENTAGE 
that  are  recognized  by  IdentiFinder.  The  version  of 
IdentiFinder  used  in  our  system  can  only  mark  up  seven 
types  of  name  entities  and  this  limits  the  system’s  ability  to 
answer  other  types  of  questions.  The  number  recognizer  is 
the  first  example  of  the  type  of  refinement  to  named  entity 
recognition  that  must  be  done  for  better  performance. 

An  example  of  a  question  requiring  a  numeric  answer  is: 

A ’fi H  H  ?  (What  is  the  number  of 
Clinton’s  presidency?)”.  This  question  could  be  answered 
in  Marsha  v2  by  extracting  the  marked  up  number  from  the 
best  passage  in  the  answer  extraction  part,  while  Marsha  vl 
could  only  return  the  top  5  passages  that  were  likely  to  have 
the  answer  to  this  question. 

Another  improvement  relates  to  the  best  matching  window 
of  a  passage.  The  size  of  the  matching  window  in  each 
passage  is  an  important  part  of  calculating  the  belief  score 
for  the  passage.  Locating  the  best  matching  window  is  also 
important  in  the  answer-extraction  processing  because  the 
final  answer  picked  is  the  candidate  that  has  the  smallest 
average  distance  from  the  matching  window.  The  best 
matching  window  of  a  passage  here  is  the  window  that  has 
the  most  query  words  in  it  and  has  the  smallest  window 
size.  In  the  previous  version  of  our  system,  we  only 
consider  the  first  occurrence  of  each  query  word  in  a 
passage  and  index  the  position  accordingly.  The  matching 
window  is  thus  from  the  word  of  the  smallest  index  to  the 
word  of  the  largest  index  in  the  passage.  It  is  only  a  rough 
approximation  of  the  best  matching  window  though  it 
works  well  for  many  of  the  passages.  In  the  second  version 
of  Marsha,  we  developed  a  more  accurate  algorithm  to 
locate  the  best  matching  window  of  each  passage.  This 


change  helped  Marsha  v2  find  correct  answers  for  some 
questions  that  previously  failed.  The  following  is  an 
example  of  such  a  question. 

For  the  question  1)J,T  W  A  P A? 

(How  many  people  in  the  United  States  are  below  the 
poverty  line?)” 

The  best  passage  is  as  follows: 

9  2  8  B 

AKiBcATPI,  M^AAA-Ai^jjn  200^ 

A, 

3  3  5  8.  5  A>  tk  1  9  8  9  6  .  7  % ,  3^ Ail 

A  A  A  W 

This  passage  has  two  occurrences  of  query  word  “AH”. 
In  vl,  the  first  occurrence  of  “AH”  is  treated  as  the  start  of 
the  matching  window,  whereas  the  second  occurrence  is 
actually  the  start  of  the  best  matching  window.  There  are 
two  numbers  “  2  0  0  A  A”  (more  than  2  million)  and  “  3 
3  5  8.  5  A”  (33.585  million)  in  the  passage.  The  right 
answer  “3  3  5  8.  5  Tj”  (33.585  million)  is  nearer  to  the 
best  matching  window  and  “  2  0  0  A  A”  (more  than  2 
million)  is  nearer  to  the  estimated  matching  window. 
Therefore,  the  right  answer  can  be  extracted  after  correctly 
locating  the  best  matching  window. 

The  third  improvement  is  with  the  scoring  strategies  of 
passages.  Based  on  the  observation  that  the  size  of  the  best 
matching  window  of  a  passage  plays  a  more  important  role 
than  the  order  of  the  query  words  in  a  passage,  we  adjusted 
the  score  bonus  for  same  order  satisfaction  from  0.5  to 
0.05.  This  adjustment  makes  a  passage  with  a  smaller 
matching  window  get  a  higher  belief  score  than  a  passage 
that  satisfies  the  same  order  of  query  words  but  has  a  bigger 
matching  window.  As  an  example,  consider  the  question: 

— 'AAH/i^'^^?  (Who  was  the  first  president  in 
the  United  States?)”. 

Passage  1  is  the  passage  that  has  the  right  answer  “A tn  A 
Passage  1. 

“  1992Al2ij26B  MS)jA#pn:AA)A#pm:HPi 
glj  f  Ij  #xh:5#lm:  t!i  If-  “  1%  #ti:  A  H  fiti  tfclR  A  ^L#au:  jt 
A#rw:  A  H  “A  f  n  A  it!l#rw:  A  A  #‘®#rw: 

Passage  2. 


‘ii  H  af4^  B  jf  m  wa 

f°], 

Passage  1  and  Passage  2  both  have  all  query  words.  The 
size  of  the  best  matching  window  in  Passage  1  is  smaller 
than  that  in  Passage  2  while  query  words  in  Passage  2  have 
the  same  order  as  that  in  the  question.  The  scoring  strategy 
in  Marsha  v2  selects  Passage  1  and  extracts  the  correct 
answer  while  Marsha  vl  selected  Passage  2. 

Special  processing  of  ordinals  has  also  been  considered  in 
Marsha  v2.  Ordinals  in  Chinese  usually  start  with  the 
Chinese  character  and  are  followed  by  a  cardinal.  It  is 
better  to  retain  ordinals  as  single  words  during  the  query 
generation  in  order  to  retrieve  better  relevant  documents. 
However,  the  cardinals  (part  of  the  ordinals  in  Chinese)  in  a 
passage  are  marked  up  by  the  number  recognizer  for  they 
might  be  answer  candidates  for  questions  asking  for  a 
number.  Thus  ordinals  in  Chinese  need  special  care  in  a  QA 
system.  In  Marsha  v2,  ordinals  appearing  in  a  question  are 
first  retained  as  single  words  for  the  purpose  of  generating  a 
good  query  and  then  separated  in  the  post  processing  after 
relevant  documents  are  retrieved  to  avoid  answer 
candidates  being  ignored. 

5.  Comparison  with  English  Question 
Answering  Systems 

Some  techniques  used  in  Marsha  are  similar  to  the 
techniques  in  English  question  answering  systems 
developed  by  other  researchers.  The  template  matching  in 
Marsha  for  deciding  the  type  of  expected  answer  for  a 
question  is  basically  the  same  as  the  one  used  in  the 
GuruQA  (Prager  et  al.,  2000)  except  that  the  templates 
consist  of  Chinese  word  patterns  instead  of  English  word 
patterns.  Marsha  has  the  ability  of  providing  answers  to 
eight  types  of  questions:  PERSON,  LOCATION, 
ORGANIZATION,  DATE,  TIME,  MONEY, 
PERCENTAGE,  and  NUMBER.  The  first  seven  types 
correspond  to  the  named  entities  from  IdentiEinder 
developed  by  BBN.  We  developed  a  Chinese  number- 
recognizer  ourselves  which  marks  up  numbers  in  the 
passages  as  answer  candidates  for  questions  asking  for  a 
number.  The  number  could  be  represented  as  a  digit 
number  or  Chinese  characters.  David  A.  Hull  used  a  proper 
name  tagger  ThingEinder  developed  at  Xerox  in  his 
question  answering  system.  Live  of  the  answer  types 
correspond  to  the  types  of  proper  names  from  ThingEinder 
(Hull,  1999).  The  scoring  strategy  in  Marsha  is  similar  to 
the  computation  of  score  for  an  answer  window  in  the 
LASSO  QA  system  (Moldovan  et  al.,  1999)  in  terms  of  the 
factors  considered  in  the  computation.  Eactors  such  as  the 
number  of  matching  words  in  the  passage,  whether  all 


matching  words  in  the  same  sentence,  and  whether  the 
matching  words  in  the  passage  have  the  same  order  as  they 
are  in  the  question  are  common  to  LASSO  and  Marsha. 

We  have  also  implemented  an  English  language  version  of 
Marsha.  The  system  implements  the  answer  classes 
PERSON,  ORGANIZATION,  LOCATION,  and  DATE. 
Queries  are  generated  in  the  same  fashion  as  Marsha.  If 
there  are  any  phrases  in  the  input  query  (named  entities 
from  IdentiEinder,  quoted  strings)  these  are  added  to  an 
Inquery  query  in  a  #N  operator  all  inside  a  #sum  operator. 
Eor  example: 

Question:  "Who  is  the  author  of  "Bad  Bad  Leroy 
Brown" 

Inquery  query:  #sum(  #uw8(author  Bad  Bad  Leroy 
Brown)  #6(Bad  Bad  Leroy  Brown)) 

Where  N  is  number  of  terms  +  1  for  named  entities,  and 
number  of  terms  +  2  for  quoted  phrases.  If  a  query  retrieves 
no  documents,  a  “back  off’  query  uses  #sum  over  the  query 
terms,  with  phrases  dropped.  The  above  would  become 
#sum( author  Bad  Bad  Leroy  Brown). 

The  system  was  tested  against  the  TREC9  question 
answering  evaluation  questions.  The  mean  reciprocal  rank 
over  682/693  questions  was  0.300  with  396  questions  going 
unanswered.  The  U.Mass.  TREC9  (250  byte)  run  had  a 
score  of  0.367.  Considering  only  the  document  retrieval,  we 
find  a  document  containing  an  answer  for  471  of  the 
questions,  compared  to  477  for  the  official  TREC9  run 
which  used  expanded  queries.  This  indicates  that  the 
Marsha  heuristics  have  applicability  to  the  English  question 
answering  task  and  are  not  limited  to  the  Chinese  question 
answering  task. 


6.  Summary  and  Future  Work 

The  evaluations  on  Marsha,  although  preliminary,  indicate 
that  techniques  developed  for  question  answering  in  English 
are  also  effective  in  Chinese.  In  future  research,  we  plan  to 
continue  to  improve  these  techniques  and  carry  out  more 
careful  evaluations  to  establish  whether  there  are  any 
significant  differences  in  the  question-answering  task 
between  these  two  languages. 

The  evaluation  of  the  English  version  of  Marsha  indicates 
that  the  Marsha  heuristics  work  well  in  English  as  well  as  in 
Chinese.  We  now  plan  to  incorporate  these  techniques  in  a 
cross-lingual  question-answering  system  for  English  and 
Chinese.  By  using  two  systems  with  similar  question 
processing  strategies,  we  hope  to  exploit  the  query 
templates  to  produce  accurate  question  translations. 


We  have  also  started  to  develop  a  probabilistic  model  of 
question  answering  using  the  language  model  approach 
(Ponte  and  Croft,  1998).  This  type  of  model  will  be 
essential  for  extending  the  capability  of  QA  systems  beyond 
a  few  common  query  forms. 
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